library(ggplot2)
library(plotly)
library(MASS)
library(pls)
library(leaps)
set.seed(2112)
knitr::opts_chunk$set(
warning = FALSE,
message = FALSE
)
options(browser = 'chromium')
We are to investigate the effects of noise on PCA, based on iris dataset.
noisy_pca = function(X, c, noise_n, noise_sd) {
n_obs = dim(X)[1]
noise = rnorm(noise_n * n_obs, 0, noise_sd)
dim(noise) = c(n_obs, noise_n)
pX = cbind(X, noise)
pca = prcomp(pX)
ggplot() +
geom_point(aes(x = pca$x[,1], y = pca$x[,2], color = c))
}
First, iris without noise:
noisy_pca(iris[,1:4], iris[,5], 0, 1)
Then, iris with 100 columns worth of \(\mathcal{N}(0, 1)\) noise:
noisy_pca(iris[,1:4], iris[,5], 100, 1)
We can see that the noisy columns have made the separation between the different types far less clear.
Next, iris with 100 columns \(\mathcal{N}(0, 0.1^2)\) noise:
noisy_pca(iris[,1:4], iris[,5], 100, 0.1)
and with \(\sigma = \mathcal{N}(0, 4^2)\) noise:
noisy_pca(iris[,1:4], iris[,5], 100, 4)
For \(\sigma = 0.1\) the change is fairly small, whereas for \(\sigma = 4\) the change is significant. Let’s look at the variances of the original columns:
apply(iris[,1:4], 2, var)
## Sepal.Length Sepal.Width Petal.Length Petal.Width
## 0.6856935 0.1899794 3.1162779 0.5810063
We can see that adding “features” with \(\sigma = 4\) would necessarily distort PCA since it projects onto the “directions with largest variance”.
In this task we investigate the effect of the outliers on PCA.
First, we want to plot a sample from \((4X, 2Y, Z)\) where \(X, Y, Z \sim \mathcal{N}(0, 1)\) (presumably it’s just an cuboid-shaped point cloud):
n = 100
p = rnorm(3 * n)
dim(p) = c(n, 3)
p_df = as.data.frame(p)
colnames(p_df) = c("X", "Y", "Z")
p_df[,"X"] = 4 * p_df[,"X"]
p_df[,"Y"] = 2 * p_df[,"Y"]
fig = plot_ly(p_df, x = ~X, y = ~Y, z = ~Z)
fig = fig %>% add_markers()
fig = fig %>% layout(
scene=list(
yaxis = list(scaleanchor="x"),
zaxis = list(scaleanchor="x")
)
)
fig
Let’s look at the directions:
prcomp(p_df)$rotation
## PC1 PC2 PC3
## X 0.99962726 0.0206555 -0.01785165
## Y -0.02279905 0.9912948 -0.12967161
## Z 0.01501782 0.1300303 0.99139628
Nothing particularly interesting. And now let’s look at the same when we substitute the last observation to \((40, 40, 40)\).
p_df[dim(p_df)[1],] = c(40, 40, 40)
prcomp(p_df)$rotation
## PC1 PC2 PC3
## X 0.6773874 -0.7283802 0.1029977
## Y 0.5310666 0.5810908 0.6166861
## Z 0.5090330 0.3630368 -0.7804420
We can see that the first direction really shifted towards pointing to \((1, 1, 1)\).
biopsy[,"ID"] = NULL
biopsy[,"class"] = NULL
biopsy = biopsy[complete.cases(biopsy),]
model = pcr(V2 ~ ., data = biopsy, validation = "CV")
Let’s look at the summary:
summary(model)
## Data: X dimension: 683 8
## Y dimension: 683 1
## Fit method: svdpc
## Number of components considered: 8
##
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
## (Intercept) 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps 7 comps 8 comps
## CV 3.067 1.414 1.356 1.360 1.343 1.323 1.221 1.200 1.182
## adjCV 3.067 1.413 1.355 1.359 1.342 1.337 1.219 1.199 1.179
##
## TRAINING: % variance explained
## 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps 7 comps 8 comps
## X 67.04 74.99 81.96 86.88 90.92 94.67 97.48 100.0
## V2 78.83 80.55 80.58 81.05 81.75 84.83 85.48 86.1
So, 3 principal components explain \(\geq 80\%\) of data’s variance, and only 2 explain \(\geq 80\%\) of V2’s variance.
Let’s look at which variables contribute the most to the first principal component:
pca1 = model$loadings[,1]
pca1_ord = order(pca1)
pca1_df = pca1[pca1_ord]
colnames(pca1_df) = colnames(pca1)[pca1_ord]
pca1_df
## V6 V3 V8 V4 V1 V7 V5 V9
## -0.4927208 -0.4202931 -0.3878608 -0.3635490 -0.3257396 -0.3182930 -0.2687934 -0.1353124
Regarding the coefficients, from what I understand this should be sufficient:
model$Yloadings
##
## Loadings:
## Comp 1 Comp 2 Comp 3 Comp 4 Comp 5 Comp 6 Comp 7 Comp 8
## V2 -0.423 0.182 0.121 0.162 -0.354 0.188 0.193
##
## Comp 1 Comp 2 Comp 3 Comp 4 Comp 5 Comp 6 Comp 7 Comp 8
## SS loadings 0.179 0.033 0.001 0.015 0.026 0.126 0.035 0.037
## Proportion Var 0.179 0.033 0.001 0.015 0.026 0.126 0.035 0.037
## Cumulative Var 0.179 0.212 0.213 0.228 0.254 0.379 0.415 0.452
So with one PC we’d have \(\beta = -0.423\), and with two \(\beta = [-0.423, 0.182]^T\).
Now we shall check what number of PCs will minimize the error (here MSE) [Apologies for the horrendous plot but I couldn’t be bothered to figure out how to do it in ggplot2]:
validationplot(model, val.type = "MSEP")
Finally, we shall investigate whether the predictors most impactful on PC1 are the same as the ones selected by regsubsets:
rs = regsubsets(V2 ~ ., data = biopsy, nvmax = 8)
x = summary(rs)
x$which
## (Intercept) V1 V3 V4 V5 V6 V7 V8 V9
## 1 TRUE FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE
## 2 TRUE FALSE TRUE FALSE TRUE FALSE FALSE FALSE FALSE
## 3 TRUE FALSE TRUE FALSE TRUE FALSE TRUE FALSE FALSE
## 4 TRUE FALSE TRUE TRUE TRUE FALSE TRUE FALSE FALSE
## 5 TRUE TRUE TRUE TRUE TRUE FALSE TRUE FALSE FALSE
## 6 TRUE TRUE TRUE TRUE TRUE FALSE TRUE TRUE FALSE
## 7 TRUE TRUE TRUE TRUE TRUE FALSE TRUE TRUE TRUE
## 8 TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
We can see that the selected variables are in the order of V3, V5, V7, V4, V1, V6, V9. So (unless I did the procedure incorrectly) it’s not at all what the PCA would tell us.